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Abstract 

We prove the statistical consistency of kernel Partial Least Squares 
Regression applied to a bounded regression learning problem on a re- 
producing kernel Hilbert space. Partial Least Squares stands out of 
well-known classical approaches as e.g. Ridge Regression or Principal 
Components Regression, as it is not defined as the solution of a global 
cost minimization procedure over a fixed model nor is it a linear esti- 
mator. Instead, approximate solutions are constructed by projections 
onto a nested set of data-dependent subspaces. To prove consistency, 
we exploit the known fact that Partial Least Squares is equivalent to 
the conjugate gradient algorithm in combination with early stopping. 
The choice of the stopping rule (number of iterations) is a crucial 
point. We study two empirical stopping rules. The first one monitors 
the estimation error in each iteration step of Partial Least Squares, 
and the second one estimates the empirical complexity in terms of a 
condition number. Both stopping rules lead to universally consistent 
estimators provided the kernel is universal. 



1 INTRODUCTION 

Partial Least Squares (PLS) f lWoldl . Il975l : IWold et al.l . Il984h is a supervised 
dimensionality reduction technique. It iteratively constructs an orthogonal 
set of m latent components from the predictor variables which have maximal 
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covariance with the response variable. This low- dimensional representation 
of the data is then used for prediction by fitting a linear regression model to 
the response and the latent components. The number m of latent components 
acts as a regularizer. In contrast to Principal Components Regression, the 
latent components are respo nse-dependent. In combination with the kernel 
trick (jScholkopf et al.l . Il998l). kernel PLS pe r forms nonlinear dimensionality 
reduction and regression (IRosipal and Trejd . I2OOII ) . 

While PLS has proven to be successful in a wide range of applications, 
theoretical studies of PLS - such as its consistency - are less widespread. 
This is perhaps due to the fact that in contrast to many standard methods 
(as e.g. Ridge Regression or Principal Components Regression), PLS is not 
defined as the solution of a global cost function nor is it a linear estimator 
in the sense that the fitted values depend linearly on the response variable. 
Instead, PLS minimizes the least squares criterion on a nested subset of data- 
dependent subspaces (i.e., the subspaces defined by the latent components). 
Therefore, results obtained fo r linear estimators ar e not straightforward to 



extend to PLS. Recent work ( Naik and Tsai . 20001 : Chun and Keles . 2009 ) 



study the model consistency of PLS in the linear case. Their results assume 
that the target function depends on a finite known number £ of orthogonal 
latent components and that P LS is run at least for i steps (without early 
stopping). In this configuration, IChun and Keled (j2009h obtain inconsistency 
results in scenarios where the dimensionality can grow with the number of 
data. This underscores that the choice of the regularization (or early stop- 
ping) term m is important and that it has to be selected in a data-dependent 
manner. 

Here, we prove the universal prediction consistency of kernel PLS in the 
infinite dimensional case. In particular, we define suitable data- dependent 
stopping criteria for the number of PLS components to ensure consistency. 
For the derivation of our results, we capita lize on the close connectio n of 
PLS and the conjugate gradient algorithm (IHestenes and Stiefel Il952l ) for 
the solution of linear equations: The PLS solution with m latent compo- 
nents is equivalent to the conjugate algorithm applied to the set of normal 
equations in combination with early stopping after m iterations. We use this 
equivalence to define the population version of kernel PLS. We then pro- 
ceed in three steps: (i) We show that population kernel PLS converges to 
the true regression function, (ii) We bound the difference between empirical 
and population PLS, which is low as long as the number of iterations does 
not grow too fast. We ensure this via two different stopping criteria. The 
first one monitors the error in each iteration stop of PLS, and the second 
one estimates the empirical complexity in terms of a condition number, (iii) 
Combining the results from the two previous steps, our stopping rules lead 
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to universally consistent estimators provided the kernel is universal. We em- 
phasize that either stopping rule does not depend on any prior knowledge of 
the target function and only depends on observable quantities. 



2 BACKGROUND 

We study a regression problem based on a joint probability distribution 
P{X, Y) on X X y. The task is to estimate the true regression function 

fix) = E[Y\X = x] (1) 

based on a finite number n of observations {xi,yi) , . . . , ?/„) & Xxy. As a 
general convention, population quantities defined from the perfect knowledge 
of the distribution P will be denoted with a bar, empirical quantities without. 
We assume that / belongs to the space of Px-square-integrable functions 
C2{Px) , where Px denotes the X-marginal distribution. The vector G M" 
represents the n centered response observations yi, . . . ,yn- 

As we consider kernel techniques to estimate the true regression function 
([1]), we map the data to a reproducing kernel Hilbert space Tik with bounded 
kernel k, via the canonical kernel mapping 

: X — 7- Tik, X (f){x) = k{x, ■) . 

In the remainder, we make the following assumptions: 

(B) boundedness of the data and kernel: Y G [—1,1] almost surely and 
sup^6A-^(a;,a;) < 1 . 

(U) universality of the kernel: for any distribution Px on X , Tik is dense 
in C2iPx) ■ 



2.1 LEARNING AS AN INVERSE PROBLEM 



We very briefly review the interpretation of Kernel based regres sion as a 
statistical inverse problem, as introduced in De Vito et al.l ( 2006 ). Let us 
denote the inclusion operator of the kernel space into C2{Px) by 



X 
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This operator maps a function to itself, but between two Hilbert spaces which 
differ with respect to their geometry - the inner product of being defined 
by the kernel function k, while the inner product of C2{Px) depends on the 
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data generating distribution. The adjoint operator of T is defined as usual 
as the unique operator satisfying the condition 

for all / G C2{,Px),g G "Hk- It can be checked from this definition and the 
reproducing property of the kernel that T* coincides with the kernel integral 
operator from C2{Px) to "H^ , 

f*g = j k{., x')g{x')dP{x') = E [k{X, ■)g{X)] . 

Finally, the operator S = T*T : Hk — > 'Hk is the covariance operator for the 
random variable (f){X): 

Sg = E [k{X, ■)g{X)] = E [0(X) (0(X), ^7)] . 

Learning in the kernel space Hk can be cast (formally) as the inverse problem 
Tg = f, which yields (after right multiplication by T*) the so-called normal 
equation 

Sg = f*f. (2) 

The above equation has a solution if and only if / can be represented as a 
function of "H^, i.e. / G Tl-ik ; however, even if / ^ TT-iki the above formal 
equation can be used as a motivation to use regularization algorithms coming 
from the inverse problems literature in order to find an approximation g oi f 
belonging to the space Tik- In a learning problem, neither the left-hand nor 
the right-hand side of ([2]) is known, and we only observe empirical quantities, 
which can be interpreted as "perturbed" versions of the population equations 
wherein Px is replaced by its empirical counterpart Px,n and f hj y . Note 
that the space C2{Px,n) is isometric to M" with the usual Euclidean product, 
wherein a function g G C2{Px,n) is mapped to the n-vector {g{xi), . . . , g{xn))- 
The empirical integral operator T* : C2{Px,n) — ^ 'Hk is then given by 

1 " 

T*g = -y^g{xi)k{xi,-) . (3) 

i=l 

The empirical covariance operator S = T*T is defined similarly, but on the 
input space Hk ■ Note that if the Hilbert space Hk is finite-dimensional, the 
operator S corresponds to left multiplication with the empirical covariance 
matrix. The perturbed, empirical version of the normal equation (I2D is then 
defined as 

Sg = T*y . (4) 
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Again, if "H^ is finite-dimensional, the right-hand side corresponds to the 
covariance between predictors and response. In general, equation (jl]) is ill- 
posed, and regularization techniques are needed. Popular examples are Ker- 
nel Ridge Regression (which corresponds to Tikonov regularization in inverse 
problems) or £2-Boosting (corresponding to Landweber iterations). 



2.2 PLS AND CONJUGATE GRADIENTS 

PLS is generally described as a greedy iterative method that produces a se- 
quence of latent components on which the data is projected. In contrast with 
PCA components, which maximize the variance, in PLS, the components are 
defined to have maximal covariance with the response y. In particular, the 
latent components depend on the response. For prediction, the response y 
is projected on these latent components. 

It is however a known fact that the output of the m-th step of PLS is 
actually equivalent to a conjugate gradient (CG) alg orithm applied to the 



normal equation (jlj), stopped early at step m (see e.g. iPhatak and de Hoog 



20021 for a detailed overview). This has been established for traditional PLS, 
i.e. X = and the lin ear kernel is used, but f or the kernel PLS (KPLS) 



algorithm introduced by lRosipal and Trejd (1200 ll ) the exact same analysis is 



valid as well. Here, for reasons of clarity with the remainder of our analysis 
we therefore directly present KPLS as a CG method. 

For the self-adjoint operator S and for T*y, we define the associated 
Krylov space of order m as 

Krn {T*y, S) = span {T*y, ST*y, . . . , S'^-'T^y] C Uk . 

In other words, /Cm {T*y, S) is the linear subspace of Hk of all elements of 
the form q{S)T*y , where q ranges over the real polynomials of degree m — 1 . 
The m-th step of the C G method as app lied to the normal equation (j4]) is 



simply defined (see e.g. lEngl et all . Il996l . chap. 7) as the element gm G Tik 



that minimizes the least squares criterion over the Krylov space, i.e. 

gm = arg min \\y-Tgf. (5) 

Here, we recall that for any function g G Hk, the mapping Tg of g into 
^2{Px,n) can be equivalently represented as the n-vector {g{xi), . . . ,g{xn)) ■ 
Observe that since the Krylov space depends itself on the data (and in par- 
ticular on the response variable t/), ([5j) is not a linear estimator. 

An extremely important property of CG is that the above optimization 
problem can be exactly computed by a simple iterative algorithm which only 
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Algorithm 1 Empirical KPLS (in "H/c) 



Initialize: go = 0; Uq = T*y; do = Uo 
for m = 0, . . . , (mmax - 1) do 

ll'^mll / {dm: Sdr^) 

9m+i = gm + Oimdm (update) 

= — OimSdra (residuals) 
— II 11^ / II 11^ 

m ||^m.+l|| / ll^mll 

dm+1 = Um+1 + f3mdm (new basis vector) 
end for 

Return: approximate solution Qm^^^ 



requires to use forward applications of the operator S , sums of elements in 
T-Lk and scalar multiplications and divisions (algorithm [1]). 

In fact, the CG algorithm iteratively constructs a basis do, ■ ■ ■ , dm-i of 
the Krylov space fC„i{T*y, S). The sequence of Qm G l-ik is constructed in 
such a way that the residuals Um = T*y — Sgm are pairwise orthogonal in 
"Hfc; i-e. {uj,Uk) = for j ^ k, while the constructed basis is S'-orthogonal 
(or equivalently, uncorrected), i.e. {dj, Sdk) = for j ^ k. 

Note that the above algorithm is written entirely in "Hfc, this form being 
convenient for the theoretical analysis to come. In practice, since all involved 
elements belong to span{(A'(xj, .))i<i<n}, a weighted kernel expansion is used 
to represent these elements, and corres ponding weight update eq uations using 



the kernel matrix can be derived (see iRosipal and Trejd . 1200 ll ). 



3 POPULATION VERSION OF KPLS 

Using the CG interpretation of KPLS, we can define its population version 
as follows: 

Definition 1. Denote by G Hfc the output of algorithm IJ\ after m iter- 
ations, if we replace the empirical operator S and the vector T*y by their 
population versions S and T* f , respectively. We define population KPLS 
with m components as fm = Tgm G C2{Px) ■ 

We emphasize again that gm G Hk and G C2{Px) are identical as 
functions from X to M, but seen as elements of Hilbert spaces with a different 
geometry (norm). The first step in our consistency proof is to show that 
population KPLS converges to / (with respect to the C2{Px) norm) if 
m tends to oo. Note that even if / ^ TT-Lki we can still show that Tg^ 
converges to the projection of / onto the closure of l-ik iu C2{P) ■ If the 
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kernel is universal (U), this projection is / itself and this implies asymptotic 
consistency of the population version. We will assume for simplicity that 

(I) the true regression function / has infinitely many components in its 
decomposition over the eigenfunctions of S , 

which implies that the population version of the algorithm can theoreti- 
cally be run indefinitely without exiting. If this condition is not satisfied the 
population algorithm stops after a finite number of steps k, , at which points 
it holds that = / so that the rest of our analysis also holds in that case 
with only minor modifications. 

Proposition 1. The kernel operator of k is defined as K = TT* : C2{Px) — ^ 
^2(-Px)- We denote by V the orthogonal projection onto the closure of the 
range of K in C2{P). Then, recalling fm = Tgm where (jm is the output of the 
m-th iteration of the conjugate gradient algorithm applied to the population 
normal equation ([2]) , it holds that fm = Kqm{K)f , where Qm is a polynomial 
of degree < m — 1 fulfilling 

qm = arg min - J?g(J?)/||2 ( ) . 

degg<m— 1 ^ ' 

Proof. The minimization property when written in the population case 
yields 

qm = arg min \\f - fq{S)f* ff . 

deg g<m—l 

Furthermore, for all polynomials q 

\\f-fq{S)f*ff = \\f-Kq{K)f\\' 

= IIP (/ - Kq{K)f^ IP + II (/ - V) if - Kq{K)f) ^ 
= \\Vf-Kq{K)fr + Ul-V)fY- 

As the second term above does not depend on the polynomial q, this yields 
the announced result. □ 

This leads to the following convergence result. 

Theorem 1. Let us denote by fm. the projection of f onto the first m principal 
components of the operator K . We have 

\\Vf-fm\Wp^) < \\Vf-fm.\\c,iP.). 

In particular, 

fm Vf mC^iPx). 
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This theorem is an extension of the finite-dimensional results by lPhatak and de Hoog 



(120021 ) ■ 



Proof. We construct a sequence of polynomials of degree < m — 1 such 
that 

\\Vf-KqUK)f\\ < \\Vf-fm\\ 

and then exploit the minimization property of Proposition [TJ Let us con- 
sider the first m eigenvalues Ai, . . . , Am of the operator K with corresponding 
eigenf unctions 0i, . . . , 0^- Then, by definition 



fm = Yl (/' ' ^/ = E 



i=l 



i=l 



The polynomial 



Pm(A) 



n 

i=l 



A,- -A 



A,- 



(6) 



fulfills pvrt(O) = 1, hence it defines a polynomial qm of degree < m — 1 via 
Pm(A) = 1 — Agm,(A) . As the zeroes of pm are the first m eigenvalues of K, 
the polynomial qm has the convenient property that it "cancels out" the first 
m eigenf unctions, i.e. 



\\Vf-KqUK)fT = E 

i=m+l 

By construction, pvrt(Ai)^ < 1 for i > m, and hence 

oo 

\\vf^KqUK)fr< (/><^^>' = r/-/™ir- 

i=m+l 

As the principal components approximations fm converge to Vf, this con- 
cludes the proof. □ 

As the rate of convergence of the population version is at least as good as 
the rate of the principal components approximations, this theorem shows in 
particular that the conjugate gradient method is less biased than Principal 
Components Analysis. This fact is kno \ yn for linear PLS in the empirical case 
(jPe Jongl . I1993I : IPhatak and de Hoogl . I2OO2I ). By the same token, KPLS is 
less biased than ^2-Boosting, as the latter corresponds to the fixed polynomial 



Qmit) = Yli^ (1 ~ty , . However, empirical findings suggest that for KPLS, 
the decrease in bias is balanced b y an in creased complexity in terms of degrees 
of freedom (IKramer and Braunl . 120071 ). The goal of the next section is to 
introduce a suitable control of this complexity. 
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4 CONSISTENT STOPPING RULES 



4.1 ERROR MONITORING 

We control the error between the population case and the empirical case by 
iteratively monitoring upper bounds on this error. Since this bound only 
involves known empirical quantities, we design a stopping criterion based 
on the bound. This then leads to a globally consistent procedure. The key 
ingredient of the stopping rule is to bound the differences for u, x, d (defined 
in algorithm [1]) if we replace the empirical quantities S and T*y by their 
population versions. Note that algorithm [1] involves products and quotients 
of the perturbed quantities. The error control based on these expressions can 
hence be represented in terms of the following three functions. 

Definition 2. For any positive reals x > 6x > define 

and for any positive reals (x, y, 6^, Sy) define 

^{x,y,6x,6y) = x6y + y6x + SJy] 
^'{x,y,6x,6y) = x5y + y5x. 

The usefulness of these definitions is justified by the following standard 
lemma for bounding the approximation error of inverse and products, based 
only on the knowledge of the approximant: 

Lemma 1. Let a, a he two invertible elements of a Banach algebra B , with 
\\a — «|| < 5 and \\ci^^\\ ^ > 5 . Then 

Let Bi, B2 be two Banach spaces and assume an assocative product operation 
exists from Bi XB2 to a Banach space B3 , satisfying for any {xi, X2) G B1XB2 
the product compatibility condition ||a;ia;2|| < ||a;i|| ||a;2|| ■ Let a, a in Bi and 
/3,l3 E B2 , such that \\a — a\\ < Sa and \\(3 — f3\\ < . Then 

\\af3-a^\\<ma\\,m,6^,6p). 

In the same situation as above, if it is known that \\a\\ < C , then 

\\af3-a^\\<e{C,m\,6^,6p). 
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Furthermore, we can bound the deviation of the 'starting values' S and 
T*y: 

Lemma 2. Set En = ^\/ (logra)/^ . // the kernel is bounded (B), with prob- 
ability at least 1 — ra"^, 

\\T*y-ff\\ <£„ and \\S - S\\ < . 



The second bound is well-known , see e.g. IShawe- Taylor and Cristianini 



(120031 ) ; IZwald and BlanchardI (120061 ) , and the first one is based on the same 
argument. 

The error monitoring updates corresponding to algorithm [1] are displayed 
in algorithm [21 Note that the error monitoring initialization and update only 
depend on observable quantities. 

Algorithm 2 Error Control for Algorithm [T] 
Initialize: 6^ = 0; En = ^\/ (logn)/?7,; 5q = £„; 

Initialize: £o,4 = ^(||%|| , Ikoll ■.S'^-.^t) 
for m = 0, . . . , (mmax - 1) do 

£^m,l = ^'(IMmll , 1, ^in-, ^n) 

^m,2 ^( II '-^m II ) I Mm II ; '^m' ^m,l) 

£m,3 = C{{dm,Sdm),Em,2) (if defined, else exit) 

~ ^(ll^mll 1 {dm^ SdfYi) , £m,4! ^771,3) 
~ + ^('^mi |Mm|| 5 ^m' ^m) 
~ + ^('^mi |Mm|| 5 ^m' ^m,l) 

^m,5 = Cdl'^mll^ i^m.A) (if defined, else exit) 

£^m+l,4 = ^(Ikm+lll , ||Mm+l|| , '^m+l) 

= ^(Ikm+lll ,||m~^|| , ^m+1,4, e^m.s) 
_ _|_ tf/Q II Xl3 Xd \ 

end for 



Definition 3 (First stopping rule). Fix < 7 < | and run the KPLS 
algorithmUl along with the error monitoring algorithmic Let m{^n) + 1 denote 
the first time where either the procedure exits, or 6^ > n~"' . Here, the 
subscript (n) indicates that the step is data- dependent. Output the estimate 
at the previous step, that is, the estimate Z*-"^ = Tgm^^y 

The next theorem states that this stopping rule leads to a universally 
consistent learning rule for bounded regression. 
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Theorem 2. Assume that the kernel is bounded (B) and universal (U). 
Denote by Z^"-* the output of the KPLS run on n independent observations 
while using the above stopping rule. Then almost surely 



lim 

n— >oo 



I/-/ 



\C2{Px) 



0. 



Proof. By construction /*^") 



rp An) 



We have 



(n)| 



< 



f-f. 



(n) 



We proceed in two steps. First, we establish that the second term is bounded 
by . Second, we prove that the random variable m(„) — oo almost surely 
for n — )> oo, which ensures that the first term goes to zero (using Theorem 



First Step: We have 



C2{Px) 



< 
< 



wn-g. 

\\9m - 
\\9m - 



9n 



\C2iPx) 



9: 



miloo 



(We drop the superscript {n) for m to lighten notation.) Now, we prove that 
the construction of the error monitoring iterates ensures that ||^m — fi'mll ^ 
for any m before stopping, with probability at least 1 — 2n~'^. For m = 
0, this follows immediately from Lemma [2j For a general m, it is then 
straightforward to show that Sm,! controls the error for Sdm (using \\S\\ < 1); 
£m,2 controls the error for (rffc, Sdj.) (the Cauchy-Schwartz inequality ensuring 
the compatibility condition of norm and product in Lemma [T]); £^,3 controls 
the error for the inverse of the latter; '^m; ^m; ^m; '^m; control the errors 
for the respective superscript quantities; em,A controls the error for \\um\\ 
and Em,^ for the inverse of the latter. In particular, with probability at 



least 1 — 2n ^ we have 



\9r, 



9n 



< 5^ , SO that 



9v 



definition of the stopping rule. Since the probabilities that this inequalities 
are violated are summable, by the Borel-Cantelli lemma, almost surely this 
inequality is true from a certain rank uq on. 

Second Step: This is in essence a minutely detailed continuity argument. 
Due to space limitations, we only sketch the proof. We consider a fixed 
iteration m and prove that this iteration is reached without exiting and that 
5?ri"'^ < C{m)en almost surely from a certain rank uq on (here the superscript 
(n) once again recalls the dependence on the number of data). The constant 
C(m) is deterministic but can depend on the generating distribution. This 
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obviously ensures m(„) — )■ oo almost surely. We prove by induction that this 
property is true for all error controlling quantities e^^rn and 6^ appearing in 
the error monitoring iterates. Obviously, from the initialization this is true 
for Sq, 6q and e„ . In a nutshell, the induction step then follows from the 
fact that the error monitoring functions C,^,C,' are locally Lipschitz. □ 

4.2 EMPIRICAL COMPLEXITY 

We now propose an alternate stopping rule directly based on the charac- 
terizing property ([5]) of KPLS/CG. This approach leads to a more explicit 
stopping rule. Let us define the operator Rm ■ K'" — "H: 

m 

V = {vi, ...,vj^ R„,v = J2 v^S'~'T*y . (7) 

1=1 

Then, we can rewrite the m-th iterate of KPLS/CG as Qm = RmWm , and ([5]) 
becomes 

Wm = arg min \\y - TRmw\\'^ . 
From this, it can be deduced by standard arguments that 

gm = RmM~^R'^T*y , Mm = R*mSRm ■ (8) 
The random m x m matrix M,„ has entry 

{MJ^^ = y^TS'-'SS^~'T*y = y^K^'^^^y, 

where K = TT* can be identified with the kernel Gram matrix, i.e. Km = 
k{xk,xe). Similarly, denote by the matrix with entry equal to 

y^K^i+:i-^)y . 

Definition 4 (Second stopping rule). Fix | > z/ > and let + 1 denote 

the first time where m (max(||M^|| ,m~^) ||M~^||)^ > n'^ . (If is singular, 
we set ||M~^|| = oo .) Output the KPLS estimate at step m'^^^ . 

Note that the stopping criterion only depends on empirical quantities. 
Furthermore, we can interpret the stopping criterion as an empirical com- 
plexity control of the CG algorithm at step m. Essentially, it is the dimen- 
sionality (or iteration number) m times the square of the "pseudo-condition 
number" ||M;;|| ||M-^|| . 

Theorem 3. Assume that the kernel is hounded (B) and universal (U). 
Denote by f^^'^ the output of the KPLS algorithm run on n independent ob- 
servations while using the above stopping rule. Then almost surely 

lim 11/ -/^"^L ,p , = 0. 
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Proof. We prove this result by exploiting the explicit formula (|H]) for KPLS. 
Note that the formula is also true for the population version Xm if we replace 
all empirical quantities by their population counterparts. Hence, we need to 
control the deviation of the different terms appearing in the above formula 
([8]). This boils dow n to perturbation analysis and is closely related to tech- 



niques employed by I Chun and Keled (120091 ) in a finite dimensional context. 



The following lemma summarizes the deviation control (the proof can be 
found in the appendix). 



Lemma 3. Put En = ^\/ (logn)/n . There is an absolute numerical constant 
c such that, with probability at least 1 — n"^ , whenever 

cmen{max{\\M^\\,m~')\\M-^\Y < 1, 

then 

\\gm - gm\\ < cniEn (uiax (||M;^|| , m"^) . 

Now, let c be the universal constant appearing in Lemma [3l At the 
stopping step m = by construction, 

cme„(max('||Mil| ,m"M IIM:;^!)^ < 4:C-'^ 



A/logn 



Choosing some positive 7 < ^ — u , for n large enough the right hand side is 
bounded by n"''' . By Lemma [3] and a reasoning similar to that of the proof 
of Theorem [21, this implies the bound on the estimation error with respect to 
the population version (valid with probability at least 1 — n"^): 



Jm', , Jm', . 

(n) 



< n 



By the Borel-Cantelli lemma, this inequality is therefore almost surely satis- 
fied for big enough n . 

To conclude, we establish that m'^^^ — )■ 00 almost surely as — )■ 00 . It 
suffices to show that for any fixed m , for n larger than a certain no(m) on, 
m(max(||M^|| ,m~^) ||M~-^||)^ < n'^ , implying that almost surely m'^^^ > m 
for n large enough. For this we simply show that the LHS of this inequality 
converges a.s. to a fixed number. From in the proof of Lemma [31 and 
the straightforward inequality ||M'|| < m, we see that for a fixed iteration 
m , 1 1 Mm — Mm 1 1 — ^ almost surely as n — )■ infty. The matrix Mm is non- 
singular as we assume that the population version of the algorithm does not 
exit. This implies that Mm is almost surely non-singular for n big enough 
with IIM"-*^ — M~^|| — !■ 0, and therefore almost surely ||M~-^|| converges to 
||M~"^|| , a fixed number. □ 
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5 CONCLUSION 



In this paper, we proposed two stopping rules for the number m of KPLS 
iterations that lead to universally consistent estimates. Both are based on 
the facts that (a) population KPLS is defined in terms of the covariance 
operator S and the cross- covariance T*f, and that (b) we can control the 
difference between population KPLS and empirical KPLS solutions based on 
the discrepancy of these covariances to their empirical counterparts S and 
T*y. While the first stopping rule monitors the estimation error by following 
the iterative KPLS algorithm [H the second stopping rule uses a closed-form 
representation and is expressed in terms of a pseudo condition number. Both 
rules do not require any prior knowledge on the target function and can be 
computed from the data. 

Our approach makes heavy use of the equivalence of PLS to the conjugate 
gradient algorithm applied to the normal equations in combination with early 
stopping. This framework also connects KPLS to statistical inverse problems. 
In this context, KPLS stands out of previously well-studied classes of meth- 
ods. In particular, as KPLS is not linear in y, it contr asts the class of linear 



spectral methods for statistical inverse problems (e.g. iBissantz et al.l . 12007 



Lo Gerfo et al.l . 120081 ) . This class considers estimates of the form 

gx = Fx{S)T*y 

in order to regularize the ill-posed problem (j4]). Here, Fx is a fixed fam- 
ily of "filter functions". Examples include Ridge Regression (also known as 
Thikonov regularization) , Principal Components Regression (also known as 
spectral cut-off), and ^2-Boosting (which corresponds to Landweber itera- 
tion). In this paper, we explained that for KPLS with m components, the 
filter function Fx is a polynomial of degree < m — 1, but that - unlike the 
class of linear spectral methods - this polynomial strongly depends on the 
response. For this reason, general results for linear spectral methods do not 
directly apply to KPLS, and additional techniques are needed. 

For linear spectral methods, optimal convergen ce rates of the resulting es 



2007; 


Bauer et al.. 


2007; 


Caponnetto. 


2006) 



the focus of future theoretical effort on KPLS will be to establish that this 
algorithm can attain these optimal rates as well. 
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A PROOF OF LEMMA S] 

The letter c denotes an absolute numerical constant whose value can possibly 
be different form line to line. 

We define Rm as the population analogue of Rm , by replacing 5" and T*y 
in ([7]) by and T* f . The population version of KPLS is then 

where = R^SRm ■ Since the index m of the iteration is now fixed, we 
omit the subscript m in i? and M. Set A = max (||M'|| ,m~^) . We have 
M' = RR* , so that ||M'|| = \\Rf . Observe that 

M' - M = Rm{Id- S)R*^, 
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is a positive-semidefinite matrix since 1 15*11 < 1; hence ||M'|| > ||M|| > 
||M-^||"\ and the assumption that C£:„m (A ||M'^||)^ < 1 with C > 1 
imphes in particular that me„ < 1 . 

We first control the difference p = \\R — R\ \ . 



P 



sup 

d:||i)|| = 1 



< sup 

d:||d||=1 



+ sup 

i;:||D||=l 



{S'-^T*y-S'-^T*f) 

=1 

m 

Y,V: {S'-'-S'-')T*y 
1=1 

m 

^^,5-1 {T*y-T*f) 



i-l 



Using lemma O the second term on the right-hand side can be bounded by 
Yl^i I'^il^n < y/mEn ■ FoT the first term, one can rewrite 



sup 

i;:||j;||=l 



i=l 



y 



sup 

i):||j;||=l 



sup 

i):||j;||=l 
m—l 



m—1 



m—l m—l 

J2 -S)Y, ViS'-^T*y 



< ^ sup 



i=i 



d: =1 



m-j 



Y,v^S'-'T*y 



i=l 



< msr,. \\R\ 



This implies 



\R — R\\ < \pfnen + VTLEn ||-R|| < 2m€n^ 



We now use repeatedly Lemma [T] to derive the final estimate from the above 
one using product and inverse operations. We omit some tedious details in 
the computations below. 

Recalling M = R*SR and using me„ < 1 , we deduce 



|M - M|| < c {men a + ^^4^) < cme^A 



(9) 



We then have 



-i||2 
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Here, we assume that cmSn^ < \\M ^\\ ^/2, which is imphed by the as- 
sumption CSniTiA'^ ||M~^||^ < 1 if C is chosen big enough. Finally, we get 

\\RM-^R* - RM~^R*\\ < cm5„A2 ||M~M|^ . 
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